Skip to content

Cp bench#5

Draft
sudhakarsingh27 wants to merge 3 commits into
cp_thd_swa_with_agfrom
cp_bench
Draft

Cp bench#5
sudhakarsingh27 wants to merge 3 commits into
cp_thd_swa_with_agfrom
cp_bench

Conversation

@sudhakarsingh27
Copy link
Copy Markdown
Owner

Description

Please include a brief summary of the changes, relevant motivation and context.

Fixes # (issue)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Move benchmarking infrastructure for CP attention onto a dedicated branch so
it persists outside of stash. The core test suite (test_attention_with_cp.py)
stays focused on correctness; this branch layers benchmark/profile/stress
configs and a cross-backend consistency check on top.

run_attention_with_cp.py changes (worker side):
- thd_seqlen_pattern arg supports max/half/linear/alternating/random and
  explicit comma-separated lengths, so benchmark configs can pin a
  specific variable-length workload instead of randomizing per-run.
- benchmark arg drives a 10-warmup + N-iter timing loop wrapped in
  cudaProfilerStart/Stop and prints ms/iter for nsys/ncu workflows.
- torch.manual_seed(1234) for reproducibility across runs.
- CP_CROSS_BACKEND_SAVE_DIR env saves per-rank inputs/outputs as .pt for
  the cross-backend consistency test to compare without re-running.
- Soft import from benchmark_cp so the worker can resolve names like
  cp_thd_0, bench_8k, bariamis_8k, rl16k without test_attention_with_cp.py
  needing to know about them.

benchmark_cp.py (new):
- Stress configs (cp_thd_0..3, cp_thd_swa_0..3) — higher batch/longer
  seqlen than the core suite.
- Llama3-8b-shaped configs (bench_8k/16k/32k).
- Variable-length training-workload configs (rl16k, bucket32k/64k/128k,
  mixed32k, outlier64k) with per-config thd_seqlen_pattern.
- Worker-only configs (bariamis_*, bench_84992/86016) for manual
  invocation against the AG spike investigation log shapes.
- test_cp_thd_cross_backend_consistency: runs each backend
  (p2p/all_gather/a2a) on the same input, saves outputs via
  CP_CROSS_BACKEND_SAVE_DIR, and asserts pairwise agreement
  within atol=0.1.

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Add 18 SWA training workload configs (6 real workloads × 3 windows)
to benchmark_cp.py for benchmarking sliding-window attention with
context parallelism. Replace the old single-GPU FusedAttn vs FlashAttn
benchmark script with a README documenting full benchmark results
(full causal + SWA, cp=2/4/8, p2p/all_gather/a2a) and individual
config runner usage.

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Re-ran all 6 real-training configs (full causal + SWA{512,1024,2048}) on a
second 8x H100 node with cuDNN 9.21 / NCCL 2.29.7 and replaced the prior
results tables. cp=2 was re-run serially because 4-wide concurrency on a single
node distorted a2a SWA timings ~2x and triggered intermittent
cudaErrorIllegalInstruction on AG SWA configs.

The original-node bucket128k SWA AG cp>=4 'FAIL' matrix is no longer present
on the new node, but a smaller intermittent-crash failure mode (cp=2 SWA AG
under heavy concurrency) was observed; documented as a known issue with the
serial-run workaround.

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant